Data

Available data set and its variables:

## Observations: 1,012
## Variables: 51
## $ max_bid                     <int> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ min_bid                     <int> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ primary_tracking_source     <fct> fr8hub, fr8hub, fr8hub, fr8hub, fr...
## $ no_bids_refused             <int> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fh_commission               <dbl> 875.00, 625.00, 1075.00, 600.00, 5...
## $ shipper_ask_price           <dbl> 4375, 3125, 5375, 4000, 2625, 1500...
## $ distance                    <dbl> 1780.3230, 1517.9297, 2177.3311, 1...
## $ duration                    <int> 97020, 95100, 136440, 102360, 8652...
## $ delivery_scheduled_until    <fct> 2017-06-22 15:00:00, 2017-07-10 19...
## $ delivery_scheduled_at_month <fct> 2017-06-01 0:00:00, 2017-07-01 0:0...
## $ delivery_scheduled_at       <fct> 2017-06-22 12:00:00, 2017-07-10 12...
## $ pickup_scheduled_at         <fct> 2017-06-19 18:00:00, 2017-07-07 18...
## $ is_hazmat                   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ carrier_closed_price        <dbl> 3500.000, 2500.000, 4300.000, 2687...
## $ load_description            <fct> consumer goods, pipes , building m...
## $ load_value                  <dbl> 100000, 100000, 100000, 100000, 10...
## $ load_equipment              <fct> Dry Van, Dry Van, Flatbed, Flatbed...
## $ destination_point           <fct> 38.8645105, -76.7279378, 41.006085...
## $ destination_point_lat       <dbl> 38.86451, 41.00608, 47.68264, 41.3...
## $ destination_point_long      <dbl> -76.72794, -83.64345, -117.40863, ...
## $ destination_postal_code     <int> 20774, 45840, 99207, 44139, 78045,...
## $ destination_state           <fct> Maryland, Ohio, Washington, Ohio, ...
## $ destination_city            <fct> Upper Marlboro, Findlay, Spokane, ...
## $ destination_country         <fct> United States, United States, Unit...
## $ origin_point                <fct> 27.617417, -99.523012, 27.5996843,...
## $ origin_point_lat            <dbl> 27.61742, 27.59968, 27.61715, 27.6...
## $ origin_point_long           <dbl> -99.52301, -99.49736, -99.47522, -...
## $ origin_postal_code          <int> 78045, 78045, 78045, 78045, 60618,...
## $ origin_state                <fct> Texas, Texas, Texas, Texas, Illino...
## $ origin_city                 <fct> Laredo, Laredo, Laredo, Laredo, Ch...
## $ origin_country              <fct> United States, United States, Unit...
## $ carrier_name                <fct> Falcon Transport Inc, Falcon Trans...
## $ shipper_closed_price        <dbl> 4375.000, 3125.000, 5375.000, 3287...
## $ status                      <fct> completed, completed, completed, c...
## $ is_multistop                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ id                          <fct> 55ade5c6-5512-11e7-9d6d-0a580a0005...
## $ shipper_id                  <fct> 95ef81e4-3f50-11e7-8ffc-0a580a0001...
## $ shipment_no                 <int> 115, 140, 141, 159, 133, 294, 251,...
## $ matched_at                  <fct> 2017-06-19 17:17:32, 2017-07-07 16...
## $ is_cross_border             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ carrier_id                  <fct> 0e38bd80-3efe-11e7-8a0f-0a580a0001...
## $ is_completed                <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE...
## $ is_dropped                  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ unloaded_at                 <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ posted_at                   <fct> 2017-06-19 17:11:30, 2017-07-07 16...
## $ is_halted                   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ shipper_name                <fct> Ventus Freight LLC, Ventus Freight...
## $ carrier_winning_bid         <int> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ diffhourS                   <dbl> 3, 7, 9, 0, 2, 2, 9, 0, 4, 10, 9, ...
## $ diffhourSP                  <dbl> 66.0, 66.0, 92.0, 116.0, 90.0, 40....
## $ diffhourMP                  <dbl> 0.100555556, 0.047777778, 0.035000...

Note that three new variables have been created:

  1. \(diffhourS = delivery\_scheduled\_until - delivery\_scheduled\_at\),
  2. \(diffhourSP = delivery\_scheduled\_at - pickup\_scheduled\_at\) and
  3. \(diffhourMP = matched\_at - posted\_at\).

calculating time differences in hours.

Looking at the available variables we assume that fh_commision is a key variable of interest and is derived by \[fh\_commision = shipper\_closed\_price - carrier\_closed\_price\] We will start with examining the response variable fh_commision, by looking at its distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1750.0   100.0   200.0   247.5   388.5  1350.0

There is a long ‘tail’ to the left caused by an extremely negative value. The majority of the observations are centered around the value of \(200\) and data is slightly right skewed.

The questions that need addressing:

  • is it ok to expect negative fh_commissions?
  • is it possible to have extreme negative values or should they be treated as outliers?

We should identify the extreme observations:

which(mydata$fh_commission < -500)
## [1] 203
which(mydata$fh_commission > 1000)
## [1]   3 217 311 315

Since this variable is a linear combination of shipper_closed_price and carrier_closed_price it would be useful to examine the correlation between those three variables.

If we take a look once again, but this time without extreme negative fh_commition:

we will notice a very strong relationship between shipper_closed_price and carrier_closed_price. Should we model them? But!!! If we stick with fh_commission as our key variable of interest would it be appropriate to consider using shipper_closed_price and carrier_closed_price as explanatory variables in the predictive model: They would be unknown and we would indirectly predict them. The answer is clearly NO!

For that reason shipper_closed_price and carrier_closed_price will not be considered for our model.

Bivariate analysis

Let us look at bivariate relationships between response variable fh_commission and potential explanatory variables.

1. shipper_ask_price

fh_commission vs shipper_ask_price: M v M

To explore the relationship between measured type variables we fit a regression model for a given response y and the explanatory variable x: \[ y = b_0 + b_1x + e, \] where e is the error term (part of the variablity in y that is not explained by the fitted model ie. explanatory variable(s)).

First we’ll look at the summary of the explanatory variable shipper_ask_price.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      90    1600    2572    2680    3125  100000

It would be useful to identify the observations with very high shipper_ask_price above 20K

## [1]  17 106 524 550 573 578 754

and look at the spread of the data once again without the observations above this threshold.

The key question is how important is the shipper_ask_price variable in explaining the variability in the fh_commission. To analyse this we fit a simple regression model: \[fh\_commission = b_0 + b_1shipper\_ask\_price\]

## 
## Call:
## lm(formula = fh_commission ~ shipper_ask_price, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2017.7  -146.9   -47.6   125.7  1076.1 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.301e+02  8.466e+00  27.175  < 2e-16 ***
## shipper_ask_price 6.491e-03  1.741e-03   3.728 0.000204 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 224.7 on 1010 degrees of freedom
## Multiple R-squared:  0.01358,    Adjusted R-squared:  0.0126 
## F-statistic:  13.9 on 1 and 1010 DF,  p-value: 0.0002035

It appears to be a statistically significant relationship, even though the model accounts for only \(1.36\%\) of variability in fh_commission (\(R^2=1.36\%\)).

2. distance

fh_commission vs distance: M v M

We will do the same analysis as above:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.41  766.21 1228.19 1153.39 1506.17 2545.01

Distribution of distance doesn’t present any issues.

## 
## Call:
## lm(formula = fh_commission ~ distance, data = mydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2206.26  -130.93   -29.44   112.90   964.07 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 68.80971   16.68203   4.125 4.02e-05 ***
## distance     0.15490    0.01325  11.687  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 212.4 on 1010 degrees of freedom
## Multiple R-squared:  0.1191, Adjusted R-squared:  0.1182 
## F-statistic: 136.6 on 1 and 1010 DF,  p-value: < 2.2e-16

This is a statisticly valid relationship with \(R^2=11.91\%\).

3. duration

fh_commission vs duration: M v M

We expect to get very similar outcomes as for the previous analysis of fh_commission vs distance, since duration and distance are highly correlated.

cor(mydata$distance, mydata$duration)
## [1] 0.9894508

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     720   53625   77340   75253   96135  186360

Distribution of distance doesn’t present any issues.

## 
## Call:
## lm(formula = fh_commission ~ duration, data = mydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2182.92  -130.42   -29.52   114.82  1020.67 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.516e+01  1.735e+01   4.332 1.62e-05 ***
## duration    2.290e-03  2.124e-04  10.778  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 214.3 on 1010 degrees of freedom
## Multiple R-squared:  0.1032, Adjusted R-squared:  0.1023 
## F-statistic: 116.2 on 1 and 1010 DF,  p-value: < 2.2e-16

This is a statisticly valid relationship with \(R^2=10.32\%\).

4. delivery_scheduled_until: month, day

delivery_scheduled_until slows down over the summer period and does not show any pattern over the monthly period.

5. delivery_scheduled_at: month, day

We’ll do the same analysis of delivery_scheduled_at.

We see very similar results as for delivery_scheduled_until.

6. pickup_scheduled_at: month, day

Monthly figures are almost identical, with some slight deviations for days without any significant pattern.

7. is_hazmat

fh_commission vs is_hazmat: M v A(2)

This is a ‘measured vs. attribute’ type of problem. We split the measured variable into subgroups according to the levels of the attribute variable to assess the similarity of the subdistributions. The question we want to answer is: Are the subdistributions the same or different?

Graphical visualisation of the two subdustributions might help to answer this question! Boxplots are the standard graphical method of displaying this sort of data. The graphical display of the boxplot enables both the difference in ‘means’ relative to their spreads (variability) to be assessed.

There are only a few

## [1] 6

observations with is_hazmat = TRUE and it would not be suitable to use for the modelling. Nonetheless, it is interesting to notice that all the fh_commisions for is_hazmat = TRUE are around their average value:

## # A tibble: 2 x 2
##   `as.factor(is_hazmat)` mean_fh
##   <fct>                    <dbl>
## 1 FALSE                     247.
## 2 TRUE                      405.
## [1] 300.00 400.00 353.25 566.25 400.00 408.00

8. load_description

fh_commission vs load_value M v A(>2)

Let us look at the summary of the `load_equipment’, but before we do that first we will check the number of possible outcomes:

## [1] 187

There are \(187\) levels all together, so we will look only at the first 21 possible outcomes to get the feel for the information it provides:

##       Polymer Beads         Sheet Metal            Aluminum 
##                 392                  53                  42 
##      Polymer Beads                 Wire               CHIPS 
##                  40                  26                  23 
##   Large Paper Rolls      polymer beads               carton 
##                  19                  18                  15 
##        Polymerbeads               Crane             Plastic 
##                  15                  13                  13 
##       polymer beads       Polymer beads                Beer 
##                  12                  12                  11 
##      consumer goods    Small Appliances        Steel Plates 
##                  11                   9                   9 
## SWIG ITEM KS TOWEL;           Autoparts               steel 
##                   9                   7                   7

It looks very messy!!! 😮 This variable would deffinitely need to be tidied up before it could be considered for any analysis. Time is needed for organising data into a suitable format and acquiring the skills required for using ‘regular expressions’ 😬. This would involve getting rid of punctuation symbols, developing consistency in typing singular or plural, spaces, capital letters etc. For example, we should aim to get something like this, but tidier and better:

##     POLYMERBEADS       SHEETMETAL         ALUMINUM            CHIPS 
##              495               53               47               26 
##             WIRE  LARGEPAPERROLLS           CARTON             BEER 
##               26               20               18               16 
##            STEEL    CONSUMERGOODS            CRANE          PLASTIC 
##               14               13               13               13 
##          PEANUTS  SMALLAPPLIANCES      STEELPLATES  SWIGITEMKSTOWEL 
##                9                9                9                9 
##        AUTOPARTS     MACHINEPARTS SWINGITEMKSTOWEL       FIBERGLASS 
##                8                6                6                5
## [1] 152

This does look better, but it still has over \(150\) levels. This variable would need some considerable attention before it could be deemed suitable for modelling.

9. load_value

fh_commission vs load_value M v M

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      10   45000   50000   61114   80000  600001

Let us see the spread without the ‘outliers’ identified on the boxplot at the positions

## [1] 168 347
## [1] 600001 150000

## 
## Call:
## lm(formula = fh_commission ~ load_value, data = mydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2001.00  -148.92   -48.92   140.96  1107.59 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.554e+02  1.517e+01  16.835   <2e-16 ***
## load_value  -1.302e-04  2.193e-04  -0.593    0.553    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 226.2 on 1010 degrees of freedom
## Multiple R-squared:  0.0003486,  Adjusted R-squared:  -0.0006411 
## F-statistic: 0.3522 on 1 and 1010 DF,  p-value: 0.553

The relationship is not statistically significant with \(R^2=0.03\%\) and \(p=0.55\).

10. load_equipment

fh_commission vs load_equipment: M vs A(>2)

Informal part of the analysis for a measured response against an attribute explanatory variable with more than two levels is exactly the same as for the data analysis situation were the explanatory variable has exactly two levels; the interpretation is a little more difficult due to the larger number of levels but the principles are exactly the same.

To detect a connection/link between the response variable and the explanatory variable requires a clear definition of exactly what a connection/link is.

The formal definition of no link for ‘M vs A’ is:

  • If the average value of the response variable is independent of the level of the attribute explanatory variable then the response variable and the attribute explanatory variable are independent (not connected).

Conversely:

  • If the average value of the response variable is dependent on the level of the attribute explanatory variable then the attribute explanatory variable influences the value of the response variable, so the response variable and the attribute explanatory variable are connected.

An examination of the means and the comparative boxplots will yield one of three possible decisions:

  • On the sample evidence there is clear evidence of no connection.
  • On the sample evidence there is clear evidence of a connection.
  • The sample evidence is inconclusive and further, more formal data analysis is required.

If further data analysis is required then it takes the form of a hypothesis test. The data analysis situation Measured v Attribute requires two different hypotheses tests. A hypothesis test for the data analysis situation where the attribute explanatory variable has exactly two levels uses t-test. A different hypothesis test is required for the situation where the attribute explanatory variable has three or more levels. Formal Data Analysis for ‘M v A(>2)’ is known as One-Way Analysis of Variance (often abbreviated as one-way ANOVA), for which we use the F-test.

The problem is exactly the same as with the two level attribute situation, namely is the difference between the means large enough to suggest that there is a real difference, or is the difference a difference that could have occurred by pure chance? (The difference is within the limits of sampling error). Implying that if there is no connection then by definition all the true means will be the same, whilst if there is a connection then the means are likely to be different.

Let us obtain the summary of the `load_equipment’

## Dry Van Flatbed  Reefer 
##     937      65      10

It appears that data is not balanced as there are only a few observations in Reefer category. Let us look at the boxplots to examine the relationship.

Boxplots for Flatbed and Dry Van categories are overlapping and since there are only a few observations in Reefer with almost all values around the median it is hard to make a clear conclusion about the relationship between the two variables: fh_commission and load_equipment. Although one-way Anova might not be the most appropriate further analysis considering the ‘shapes’ of subdistributions it could still provide us with an insight that could help us in identifying a possible relationship. We will also perform a nonparametric Kruskal-Wallis test.

##                  Df   Sum Sq Mean Sq F value   Pr(>F)    
## load_equipment    2  1831610  915805   18.53 1.25e-08 ***
## Residuals      1009 49877788   49433                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Kruskal-Wallis rank sum test
## 
## data:  fh_commission by load_equipment
## Kruskal-Wallis chi-squared = 14.428, df = 2, p-value = 0.0007363

For both tests the \(p\) values are well below \(0.1\%\) suggesting that on the sample evidence this is a statistically significant relationship.

11. destination_state

fh_commission vs destination_state: M v A(>2)

##              Alabama              Arizona             Arkansas 
##                   23                   78                   76 
##           California     Ciudad de México Coahuila de Zaragoza 
##                  151                    6                   14 
##             Colorado          Connecticut     Estado de México 
##                   32                    4                    8 
##              Florida              Georgia           Guanajuato 
##                    5                   18                    2 
##                Idaho             Illinois              Indiana 
##                    1                   16                   17 
##                 Iowa               Kansas             Kentucky 
##                   19                   11                   14 
##            Louisiana             Maryland        Massachusetts 
##                    4                    2                    2 
##             Michigan            Minnesota          Mississippi 
##                    8                   13                    6 
##             Missouri              Montana             Nebraska 
##                    6                   11                   24 
##               Nevada           New Jersey           New Mexico 
##                    4                    3                    5 
##             New York       North Carolina           Nuevo León 
##                    1                   14                    8 
##                 Ohio             Oklahoma               Oregon 
##                   62                   25                    5 
##         Pennsylvania      San Luis Potosí       South Carolina 
##                    8                    4                   16 
##           Tamaulipas            Tennessee                Texas 
##                    8                   15                  211 
##                 Utah             Virginia           Washington 
##                   21                   15                    5 
##        West Virginia            Wisconsin 
##                    1                   10

There are olmost \(50\) different levels for destination_state. Categorical data described through a large number of distinct values poses a serious challenge for regression algorithms which require numerical inputs. If we decide to use this variable in the predictive model it would be good to consider using target encoding also known as impact encoding. This technique is explained in Daniele Micci-Barreca’s papeer A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems.

Let us conduct informal data analysis for fh_commission vs destination_state.

There is a clear difference amongst the groups, but the data is not balanced and not equally spread. For a M vs A(>2) type of problem we will perform a nonparametric Kruskal-Wallis test.

## 
##  Kruskal-Wallis rank sum test
## 
## data:  fh_commission by destination_state
## Kruskal-Wallis chi-squared = 253.62, df = 46, p-value < 2.2e-16

The output (p-value < 2.2e-16) suggests that based on the sample evidence this is a statistically significant relationship.

12. destination_: postal_code; city; country

Considering the type of information those variables are providing it would not be wrong to assume that they are in high correlation whit each oter. As destination_state can be observed as conglomerate of the other three, we will check their independence from it using Chi-squared Test of Independence.

## postal_code:
## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 46368, df = 9522, p-value < 2.2e-16
## city:
## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 45047, df = 8372, p-value < 2.2e-16
## country:
## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 1012, df = 46, p-value < 2.2e-16

The \(p\) values are confirming our assumtion about those variable not being independent from each other.

13. origin_state

fh_commission vs origin_state: M v A(>2)

We will conduct the equivalent analysis to the one we did for fh_commission vs destination_state, for obvious reasons.

Yet again, there is a clear difference amongst the groups, but the data is not balanced and not equally spread. We will perform Kruskal-Wallis test.

## 
##  Kruskal-Wallis rank sum test
## 
## data:  fh_commission by origin_state
## Kruskal-Wallis chi-squared = 155.37, df = 33, p-value < 2.2e-16

The output (p-value < 2.2e-16) suggests that based on the sample evidence this is a statistically significant relationship.

14. origin_: postal_code; city; country

To check for independence between origin_state and the three variables above, we will perform Chi-squared Test of Independence for all.

## postal_code:
## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = NaN, df = 2937, p-value = NA

The data is messy and we clearly have too many zero frequencies in observed counts causing Chi-squared Test to fail.

## city:
## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 31175, df = 2607, p-value < 2.2e-16
## country:
## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 1012, df = 33, p-value < 2.2e-16

The calculated \(p\) values are confirming our assumption about those variables not being independent from each other.

15. carrier_name

fh_commission vs carrier_name: M v A(>2)

Let’s see how many levels this variable has.

## [1] 309

\(309\) different categories are too many for plotting on a boxplot and this is a variable that would have to be encoded for its application in the predictive model.

16. status

fh_commission vs status: M v A(>2)

##    at_stop  completed   go_to_pu in_transit      on_do    pending 
##          1        960          9         13          1          5 
##   unloaded 
##         23

Is there a point considering this variable in relation to the fh_commission as the pricing would have been done before the status cauld be obtained?

17. is_multistop

fh_commission vs is_multistop: M v A(2)

##    Mode   FALSE    TRUE 
## logical     988      24

What does this variable represent? Could it influence fh_commission? Let us see how many have crossed the border:

18. is_cross_border

fh_commission vs is_cross_border: M v A(2)

##    Mode   FALSE    TRUE 
## logical     988      24

Should we expect this variable to be directly linked with destination_country?

##        Mexico United States 
##            50           962

If not, as the data suggests, why not?

19. is_completed

fh_commission vs is_completed: M v A(2)

This information is already provided through the variable status.

##    Mode   FALSE    TRUE 
## logical      52     960
## status:
##    at_stop  completed   go_to_pu in_transit      on_do    pending 
##          1        960          9         13          1          5 
##   unloaded 
##         23

Should we expect those variables to have influence on fh_commission?

20. is_dropped

fh_commission vs is_dropped: M v A(2)

##    Mode   FALSE    TRUE 
## logical     910     102

Same question as above: Is it reasonable to consider this variable as an explanatory variable of fh_commission?

21. shipper_name

fh_commission vs shipper_name: M v A(>2)

This is an attribute variable with over \(40\) levels:

## [1] 42

which is unbalanced: Nonetheless, let’s perform Kruskal-Wallis:

## 
##  Kruskal-Wallis rank sum test
## 
## data:  fh_commission by shipper_name
## Kruskal-Wallis chi-squared = 393.69, df = 41, p-value < 2.2e-16

The \(p\) value suggests this to be a significant relationship.

22. diffhourS

fh_commission vs diffhourS: M v M

Remember that we have derived diffhourS by calculating:

\(diffhourS = delivery\_scheduled\_until - delivery\_scheduled\_at\).

We will start by observing the spread of the diffhourS

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   2.000   4.474   8.000 108.000
## [1] 135 244 349

and look at the spread of the data once again without the observations above this threshold.

Let us fit a regression model for fh_commission vs diffhourS.

## 
## Call:
## lm(formula = fh_commission ~ diffhourS, data = mydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2000.99  -150.99   -48.24   142.81  1103.73 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  250.991      8.404  29.867   <2e-16 ***
## diffhourS     -0.786      1.001  -0.785    0.433    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 226.2 on 1010 degrees of freedom
## Multiple R-squared:  0.00061,    Adjusted R-squared:  -0.0003795 
## F-statistic: 0.6164 on 1 and 1010 DF,  p-value: 0.4326

Even if we removed extreme values, the line of best fit would still be flat, suggesting that there is no significant relationship, as is confirmed by the large \(p\) value of \(.4326\).

23. diffhourSP

fh_commission vs diffhourSP: M v M

remember: \(diffhourSP = delivery\_scheduled\_at - pickup\_scheduled\_at\)

Let us look at the spread of the variable.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   41.00   50.00   58.84   73.56  325.00

There are a few ‘large’ observations:

## [1]  46 204 439 454 519 819 820 821

It doesn’t match any observation we picked up earlier when observing the other variables.

Let us fit a regression model for fh_commission vs diffhourS.

## 
## Call:
## lm(formula = fh_commission ~ diffhourSP, data = mydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1916.03  -127.78   -34.19   115.81  1087.71 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 140.5975    13.9114  10.107   <2e-16 ***
## diffhourSP    1.8163     0.2057   8.828   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 218 on 1010 degrees of freedom
## Multiple R-squared:  0.07164,    Adjusted R-squared:  0.07072 
## F-statistic: 77.94 on 1 and 1010 DF,  p-value: < 2.2e-16

This is a statistically significant relationship in which \(7.16\%\) of the varibility in fh_commission is explained by diffhourSP.

24. diffhourMP

fh_commission vs diffhourMP: M v M

\(diffhourMP = matched\_at - posted\_at\)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0056   0.0938   0.7417  11.5290   6.2753 393.8819

This is a right skewed distributions with a long tail to the right. We will not try to identify the observations in the right tail as there are too many. We will go to fit a regression model.

## 
## Call:
## lm(formula = fh_commission ~ diffhourMP, data = mydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1995.34  -151.75   -51.42   139.79  1097.73 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 252.3457     7.5680   33.34   <2e-16 ***
## diffhourMP   -0.4225     0.2271   -1.86   0.0631 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 225.9 on 1010 degrees of freedom
## Multiple R-squared:  0.003415,   Adjusted R-squared:  0.002428 
## F-statistic: 3.461 on 1 and 1010 DF,  p-value: 0.06312

This is a statistically significant, but weak relationship with \(0.01 > p > 0.05\).

Summary

We ultimately want to build a model:

\[ y = b_0 + b_1x_1 + b_2x_2 + ... + b_px_p\] where \(y\) is a response variable of interest and \(x_i\)’s (where \(i=1, 2, ..., p\)) are covariates, ie. explanatory variables. We have already recognised that our explanatory variables are not independent, and we should expect them to have a joint effect on some part of \(y\). There will be a relationship between \(x_i\) and \(y\) that can’t be distinguished from the relationship between \(x_j\) and \(y\), as \(x_i\) and \(x_j\) are not independent. The relationship between \(x_i\) and \(y\) is no longer the full effect of \(x_i\) on \(y\). It’s actually the marginal, unique effect of \(x_i\) on \(y\), after controlling for the effect of \(x_j\).

While the model fit as a whole will include both the joint and the unique effects of all \(x_i\)’s on \(y\), the regression coefficient for individual \(x_i\) will only include its unique effect.

We will start building the model using the available explanatory variables, but before we accept this as the best fitted model, we need to seek answers to the following questions:

  1. Do all of the explanatory variables collectively have an effect on the response variable? How big is \(R^2/R^2_{adj}\)? ie. is this a valid model worth further investigation?

If the answer to the above question(s) is YES, next we need to assess:

  1. individually do the explanatory variables have an effect on the response variable? We need to find a model that explains as much variability in \(y\) effectively using only \(x_i\)’s, which when put collectively together truly influence the explanation in variability of \(y\).

Conclusions:

1. We have assumed that fh_commision is the key variable of interest:

\[fh\_commision = shipper\_closed\_price - carrier\_closed\_price\].

Hence, the two variables: shipper_closed_price and carrier_closed_price are directly related to the response variable:

  • they need to be known in order to obtain fh_commision.

As such, we will not consider them as covariates in the model.

fh_commision has a few extreme obsevations, one of which (observation no. \(203\)) has a very low negative value \(-1,750.00\). Should we expect this to happen?

2. We have created three new variables:
  1. \(diffhourS = delivery\_scheduled\_until - delivery\_scheduled\_at\),
  2. \(diffhourSP = delivery\_scheduled\_at - pickup\_scheduled\_at\) and
  3. \(diffhourMP = matched\_at - posted\_at\),

which calculate varying time differences in hours.

Is there any other time difference that should be observed and does it make sense for them to be used in the analysis?

3. We heve descoverd that many variables are highly correlated, overlapping with the information that they contain.

Apart from the one already mentioned: \[fh\_commision = shipper\_closed\_price − carrier\_closed_price\],

we have also got the following:

  1. for destination:
  • destination_point
  • destination_point_lat
  • destination_point_long
  • destination_postal_code
  • destination_state
  • destination_city
  • destination_country
  1. for point of origin:
  • origin_point
  • origin_point_lat
  • origin_point_long
  • origin_postal_code
  • origin_state
  • origin_city
  • origin_country

Longitude and latitude points are not information that would be used in predictive modelling, but rather in graphical visualisation of the data.

Other ‘duplicated’ information is contained through the following variables:

  • distance and
  • duration

  • delivery_scheduled_at_month and
  • delivery_scheduled_at

  • status and
  • is_completed

There is a need to go through the rest of the variables to consider them carefully.

4. load_description is a very messy variable!!! 😣

This variable needs some considerable work spent on it to make it suitable for modelling. This could be done using reguar expressions to ‘standardise’ the given categories.

5. Statistically significant relationships with fh_commision are found from the following variables:
  1. Measured type:
  • shipper_ask_price: weak, but nonetheless statistically significant with \(R^2 =1.36\%\);
  • distance: statistically valid relationship with \(R^2=11.91\%\);
  • duration: statistically valid relationship with \(R^2=10.32\%\);
  • diffhourSP: statistically valid relationship with \(R^2=7.16\%\);
  • diffhourMP: statistically significant, but weak relationship with \(R^2=0.34\%\).
  1. Attribute type:
  • scheduled_at: month: with a clear dip during the summer months;
  • delivery_scheduled_at: month: with a clear dip during the summer months;
  • pickup_scheduled_at: month:with a clear dip during the summer months;
  • load_equipment: \(p < 0.001\);
  • destination_state: \(p-value < 2.2e-16\);
  • destination_: postal_code; city; country: all have \(p < 2.2e-16\);
  • origin_state: \(p-value < 2.2e-16\);
  • origin_: postal_code; city; country: postal_code is very unbalanced and we could not perform a statistical test; the other two have \(p-value < 2.2e-16\);
  • shipper_name: \(p-value < 2.2e-16\);
6. We have a few attribute variables of concern:
  • is_hazmat has only a few observations in one of its two categories making it very unbalanced for statistical modelling;
  • carrier_name has over \(300\) levels, ie. categories;
  • status: the pricing would have been done before the status could be obtained?
  • is_multistop: same information as is_cross_border
  • is_cross_border: should it be linked to destination_country?
  • is_completed: after event to the ‘pricing’?!;
  • is_dropped: ‘after event’?!
7. Variables with too many missing values:
  • max_bid: \(93.77\%\)
  • min_bid: \(93.77\%\)
  • no_bids_refused: \(93.77\%\)
  • unloaded_at: \(42.59 \%\)
  • carrier_winning_bid: \(98.42\%\)

We would need to think carefully about the best way of approaching them.

8. Variables deemed as not informative for modelling purposes:
  • primary_tracking_source: has only one possible outcome
  • id: same as system index number
  • shipper_id: same as shipper_name
  • shipment_no: same as system index number
  • destination and origin _postal_code provide the same info as destination and origin state.
9. Why destination_country and is_cross_border are not showing the same information?
## 
##        Mexico United States 
##            50           962
## 
## FALSE  TRUE 
##   988    24
10. How is variable distance calculated?

Calculating shipping distances using google api: https://developers.google.com/maps/documentation/distance-matrix/get-api-key

# google api for calculating google maps distances
library(gmapsdistance)

# test <- gmapsdistance(origin = from, 
#                      destination = to,
#                      combinations = "pairwise",
#                      key = "YOURAPIKEYHERE",
#                      mode = "walking")

dp <- paste0('"', destination_point_lat, '+', destination_point_long, '"')
op <- paste0('"', origin_point_lat, '+', origin_point_long, '"')

results = gmapsdistance(origin = op, 
                        destination = dp, 
                        mode = "driving")

# convert results$Distance from m into miles
udunits2::ud.convert(results$Distance, "m", "mi")
#
# ------------ or with:
library(googleway)
#
# test <- google_distance(origins = from,
#                        destinations = to,
#                        mode = "walking",
#                        key="YOURAPIKEYHERE")
# -----------------
#
# Interesting to see (for example the 1st observation)

results = gmapsdistance(origin = "27.617417+-99.523012", 
                        destination = "38.8645105+-76.7279378",                     
                        mode = "driving")
                        
results$Distance <- udunits2::ud.convert(results$Distance, "m", "mi")

distance[1]
[1] 1780.323
duration[1]
[1] 97020

results
$Time
[1] 93651

# There is some small discrapency 

$Distance
[1] 1774.97

$Status
[1] "OK"

Who does those calculations (distance and duration); when is this information collected?

Proposed Model

the response variable:

  • \(y = fh\_commision\)

the explanatory variables:

  • \(x_1 = shipper\_ask\_price\),
  • \(x_2 = distance\),
  • \(x_3 = duration\)
  • \(x_4 = diffhourSP\),
  • \(x_5 = diffhourMP\),
  • \(x_6 = delivery\_scheduled\_until\:month\),
  • \(x_7 = delivery\_scheduled\_at\:month\),
  • \(x_8 = pickup\_scheduled\_at\:month\),
  • \(x_9 = load\_description\), !!!!!
  • \(x_{10} = load\_equipment\),
  • \(x_{11} = destination\_state\),
  • \(x_{12} = destination\_city\),
  • \(x_{13} = destination\_country\),
  • \(x_{14} = origin\_state\)
  • \(x_{15} = origin\_city\)
  • \(x_{16} = origin\_country\)
  • \(x_{17} = shipper\_name\)
  • \(x_{18} = is\_cross_border\)

Thus, we have:

\[ y = b_0 +b_1x_1 + b_2x_2 + ... + b_{18}x_{18} \]

The majority of the variables are attribute type and, as we have seen earlier in the report when observing them individually and in relation to fh_commision many of them are correlated, providing the same type of information.

Another point that we need to consider is that many variables are high-cardinality categorical attribute variables, which will need to be encoded when used in the model fitting procedure.

Model Fitting

We will fit a model for illustrative purposes, as we are not sure if the variable fh_commision is the key variable of interest, and some of the variables we wish to consider for our model are messy and in correlation with each other. Hence, we will simplify the set of explanatory variables we wish to use.

This is the data we will use for our model:

## Observations: 1,012
## Variables: 19
## $ fh_commission                  <dbl> 875.00, 625.00, 1075.00, 600.00...
## $ shipper_ask_price              <dbl> 4375, 3125, 5375, 4000, 2625, 1...
## $ distance                       <dbl> 1780.3230, 1517.9297, 2177.3311...
## $ duration                       <int> 97020, 95100, 136440, 102360, 8...
## $ delivery_scheduled_at_month    <fct> 6, 7, 7, 7, 7, 8, 8, 8, 8, 10, ...
## $ load_description               <fct> CONSUMERGOODS, PIPES, BUILDINGM...
## $ load_equipment                 <fct> Dry Van, Dry Van, Flatbed, Flat...
## $ destination_state              <fct> Maryland, Ohio, Washington, Ohi...
## $ destination_city               <fct> Upper Marlboro, Findlay, Spokan...
## $ destination_country            <fct> United States, United States, U...
## $ origin_state                   <fct> Texas, Texas, Texas, Texas, Ill...
## $ origin_city                    <fct> Laredo, Laredo, Laredo, Laredo,...
## $ origin_country                 <fct> United States, United States, U...
## $ shipper_name                   <fct> Ventus Freight LLC, Ventus Frei...
## $ diffhourSP                     <dbl> 66.0, 66.0, 92.0, 116.0, 90.0, ...
## $ diffhourMP                     <dbl> 0.100555556, 0.047777778, 0.035...
## $ cross_border                   <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ delivery_scheduled_until_month <fct> 6, 7, 7, 7, 7, 8, 8, 8, 8, 10, ...
## $ pickup_scheduled_at_month      <fct> 6, 7, 7, 7, 6, 8, 8, 8, 8, 10, ...

Remember, there is a number of “messy” variables, wchich we will ignore for the purpose of just showing the modelling procedure and to illustrate the need for tidying them up.

## Start:  AIC=2578.69
## fh_commission ~ shipper_ask_price + distance + duration + delivery_scheduled_at_month + 
##     load_description + load_equipment + destination_state + destination_city + 
##     destination_country + origin_state + origin_city + origin_country + 
##     shipper_name + diffhourSP + diffhourMP + cross_border + delivery_scheduled_until_month + 
##     pickup_scheduled_at_month
## 
## 
## Step:  AIC=2578.69
## fh_commission ~ shipper_ask_price + distance + duration + delivery_scheduled_at_month + 
##     load_description + load_equipment + destination_state + destination_city + 
##     destination_country + origin_state + origin_city + origin_country + 
##     shipper_name + diffhourSP + diffhourMP + cross_border + pickup_scheduled_at_month
## 
## 
## Step:  AIC=2578.69
## fh_commission ~ shipper_ask_price + distance + duration + delivery_scheduled_at_month + 
##     load_description + load_equipment + destination_state + destination_city + 
##     destination_country + origin_state + origin_city + origin_country + 
##     shipper_name + diffhourSP + diffhourMP + pickup_scheduled_at_month
## 
## 
## Step:  AIC=2578.69
## fh_commission ~ shipper_ask_price + distance + duration + delivery_scheduled_at_month + 
##     load_description + load_equipment + destination_state + destination_city + 
##     destination_country + origin_state + origin_city + shipper_name + 
##     diffhourSP + diffhourMP + pickup_scheduled_at_month
## 
## 
## Step:  AIC=2578.69
## fh_commission ~ shipper_ask_price + distance + duration + delivery_scheduled_at_month + 
##     load_description + load_equipment + destination_state + destination_city + 
##     destination_country + origin_state + shipper_name + diffhourSP + 
##     diffhourMP + pickup_scheduled_at_month
## 
## 
## Step:  AIC=2578.69
## fh_commission ~ shipper_ask_price + distance + duration + delivery_scheduled_at_month + 
##     load_description + load_equipment + destination_state + destination_city + 
##     destination_country + shipper_name + diffhourSP + diffhourMP + 
##     pickup_scheduled_at_month
## 
## 
## Step:  AIC=2578.69
## fh_commission ~ shipper_ask_price + distance + duration + delivery_scheduled_at_month + 
##     load_description + load_equipment + destination_state + destination_city + 
##     shipper_name + diffhourSP + diffhourMP + pickup_scheduled_at_month
## 
## 
## Step:  AIC=2578.69
## fh_commission ~ shipper_ask_price + distance + duration + delivery_scheduled_at_month + 
##     load_description + load_equipment + destination_city + shipper_name + 
##     diffhourSP + diffhourMP + pickup_scheduled_at_month
## 
## 
## Step:  AIC=2578.69
## fh_commission ~ shipper_ask_price + distance + duration + delivery_scheduled_at_month + 
##     load_description + destination_city + shipper_name + diffhourSP + 
##     diffhourMP + pickup_scheduled_at_month
## 
##                               Df Deviance    AIC
## - delivery_scheduled_at_month  5  1077521 2571.2
## - pickup_scheduled_at_month    4  1076738 2573.1
## - diffhourSP                   1  1064695 2576.8
## - shipper_name                 2  1080064 2577.7
## - diffhourMP                   1  1071350 2578.1
## <none>                            1064200 2578.7
## - distance                     1  1076778 2579.1
## - shipper_ask_price            1  1085898 2580.8
## - duration                     1  1091566 2581.8
## - load_description            13  1308824 2594.7
## - destination_city            46  2881812 2688.9
## 
## Step:  AIC=2571.21
## fh_commission ~ shipper_ask_price + distance + duration + load_description + 
##     destination_city + shipper_name + diffhourSP + diffhourMP + 
##     pickup_scheduled_at_month
## 
##                                  Df Deviance    AIC
## - diffhourSP                      1  1077521 2569.2
## - shipper_name                    2  1091433 2569.8
## - diffhourMP                      1  1083235 2570.3
## - pickup_scheduled_at_month       9  1174739 2570.8
## - distance                        1  1087063 2571.0
## <none>                               1077521 2571.2
## - shipper_ask_price               1  1099394 2573.3
## - duration                        1  1108184 2574.9
## + delivery_scheduled_at_month     5  1064200 2578.7
## + delivery_scheduled_until_month  5  1064200 2578.7
## - load_description               13  1331715 2588.2
## - destination_city               46  3028979 2689.0
## 
## Step:  AIC=2569.21
## fh_commission ~ shipper_ask_price + distance + duration + load_description + 
##     destination_city + shipper_name + diffhourMP + pickup_scheduled_at_month
## 
##                                  Df Deviance    AIC
## - shipper_name                    2  1091628 2567.8
## - diffhourMP                      1  1083347 2568.3
## - distance                        1  1087073 2569.0
## - pickup_scheduled_at_month       9  1176950 2569.1
## <none>                               1077521 2569.2
## + diffhourSP                      1  1077521 2571.2
## - shipper_ask_price               1  1099399 2571.3
## - duration                        1  1108239 2572.9
## + delivery_scheduled_at_month     5  1064695 2576.8
## + delivery_scheduled_until_month  5  1064695 2576.8
## - load_description               13  1334018 2586.6
## - destination_city               46  3074487 2690.1
## 
## Step:  AIC=2567.85
## fh_commission ~ shipper_ask_price + distance + duration + load_description + 
##     destination_city + diffhourMP + pickup_scheduled_at_month
## 
##                                  Df Deviance    AIC
## - diffhourMP                      1  1096958 2566.8
## - distance                        1  1098254 2567.1
## <none>                               1091628 2567.8
## + destination_state               1  1083803 2568.4
## - pickup_scheduled_at_month       9  1199281 2568.9
## + shipper_name                    2  1077521 2569.2
## - shipper_ask_price               1  1112865 2569.8
## + diffhourSP                      1  1091433 2569.8
## - duration                        1  1124328 2571.8
## + delivery_scheduled_at_month     5  1080068 2575.7
## + delivery_scheduled_until_month  5  1080068 2575.7
## - load_description               30  2247425 2654.4
## - destination_city               52  3216467 2683.2
## 
## Step:  AIC=2566.84
## fh_commission ~ shipper_ask_price + distance + duration + load_description + 
##     destination_city + pickup_scheduled_at_month
## 
##                                  Df Deviance    AIC
## - distance                        1  1103429 2566.0
## <none>                               1096958 2566.8
## + destination_state               1  1089297 2567.4
## + diffhourMP                      1  1091628 2567.8
## + shipper_name                    2  1083347 2568.3
## + diffhourSP                      1  1096379 2568.7
## - shipper_ask_price               1  1118286 2568.8
## - pickup_scheduled_at_month       9  1217665 2570.0
## - duration                        1  1129149 2570.7
## + delivery_scheduled_at_month     5  1086270 2574.9
## + delivery_scheduled_until_month  5  1086270 2574.9
## - load_description               30  2250577 2652.7
## - destination_city               52  3230649 2682.1
## 
## Step:  AIC=2566.04
## fh_commission ~ shipper_ask_price + duration + load_description + 
##     destination_city + pickup_scheduled_at_month
## 
##                                  Df Deviance    AIC
## <none>                               1103429 2566.0
## + distance                        1  1096958 2566.8
## + destination_state               1  1097764 2567.0
## + diffhourMP                      1  1098254 2567.1
## - shipper_ask_price               1  1123728 2567.7
## + diffhourSP                      1  1103000 2568.0
## + shipper_name                    2  1092657 2568.1
## - pickup_scheduled_at_month       9  1233113 2570.6
## - duration                        1  1150892 2572.6
## + delivery_scheduled_at_month     5  1095296 2574.5
## + delivery_scheduled_until_month  5  1095296 2574.5
## - load_description               30  2255878 2651.2
## - destination_city               52  3235810 2680.4
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## fh_commission ~ shipper_ask_price + distance + duration + delivery_scheduled_at_month + 
##     load_description + load_equipment + destination_state + destination_city + 
##     destination_country + origin_state + origin_city + origin_country + 
##     shipper_name + diffhourSP + diffhourMP + cross_border + delivery_scheduled_until_month + 
##     pickup_scheduled_at_month
## 
## Final Model:
## fh_commission ~ shipper_ask_price + duration + load_description + 
##     destination_city + pickup_scheduled_at_month
## 
## 
##                                Step Df     Deviance Resid. Df Resid. Dev
## 1                                                          72    1064200
## 2  - delivery_scheduled_until_month  0 0.000000e+00        72    1064200
## 3                    - cross_border  0 0.000000e+00        72    1064200
## 4                  - origin_country  0 0.000000e+00        72    1064200
## 5                     - origin_city  0 0.000000e+00        72    1064200
## 6                    - origin_state  0 0.000000e+00        72    1064200
## 7             - destination_country  0 0.000000e+00        72    1064200
## 8               - destination_state  0 0.000000e+00        72    1064200
## 9                  - load_equipment  0 2.328306e-10        72    1064200
## 10    - delivery_scheduled_at_month  5 1.332102e+04        77    1077521
## 11                     - diffhourSP  1 3.237737e-01        78    1077521
## 12                   - shipper_name  2 1.410728e+04        80    1091628
## 13                     - diffhourMP  1 5.329806e+03        81    1096958
## 14                       - distance  1 6.470938e+03        82    1103429
##         AIC
## 1  2578.688
## 2  2578.688
## 3  2578.688
## 4  2578.688
## 5  2578.688
## 6  2578.688
## 7  2578.688
## 8  2578.688
## 9  2578.688
## 10 2571.213
## 11 2569.213
## 12 2567.854
## 13 2566.843
## 14 2566.037
## 
## Call:
## glm(formula = fh_commission ~ shipper_ask_price + duration + 
##     load_description + destination_city + pickup_scheduled_at_month, 
##     data = md[-train, ])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -212.76   -10.85     0.00    15.27   463.71  
## 
## Coefficients: (26 not defined because of singularities)
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -6.463e+02  4.386e+02  -1.473 0.144474    
## shipper_ask_price            5.922e-03  4.821e-03   1.228 0.222882    
## duration                     1.839e-02  9.791e-03   1.878 0.063926 .  
## load_description1           -2.152e+02  2.055e+02  -1.047 0.298115    
## load_description2           -4.293e+01  2.074e+02  -0.207 0.836542    
## load_description3           -7.479e+02  3.364e+02  -2.223 0.028939 *  
## load_description4           -7.730e+02  4.760e+02  -1.624 0.108266    
## load_description5            3.259e+02  3.239e+02   1.006 0.317287    
## load_description6           -7.126e+02  4.253e+02  -1.675 0.097646 .  
## load_description7           -8.718e+02  4.629e+02  -1.883 0.063190 .  
## load_description8           -6.280e+02  2.769e+02  -2.268 0.025955 *  
## load_description9            4.473e+02  3.127e+02   1.430 0.156464    
## load_description10           2.753e+02  4.408e+02   0.625 0.533966    
## load_description11          -8.336e+02  4.501e+02  -1.852 0.067600 .  
## load_description12          -1.442e+03  1.060e+03  -1.360 0.177428    
## load_description13          -1.370e+03  4.564e+02  -3.002 0.003552 ** 
## load_description14           4.840e+01  5.949e+02   0.081 0.935361    
## load_description15          -9.477e+02  4.297e+02  -2.205 0.030232 *  
## load_description16          -9.477e+02  4.297e+02  -2.205 0.030232 *  
## load_description17           2.213e+02  5.455e+02   0.406 0.685991    
## load_description18          -2.099e+02  3.507e+02  -0.598 0.551158    
## load_description19           4.854e+02  2.712e+02   1.790 0.077170 .  
## load_description20           1.905e+02  1.633e+02   1.167 0.246729    
## load_description21          -1.174e+03  5.519e+02  -2.128 0.036378 *  
## load_description22          -7.336e+02  3.802e+02  -1.930 0.057115 .  
## load_description23           3.898e+02  2.352e+02   1.658 0.101197    
## load_description24           7.249e+02  5.569e+02   1.302 0.196690    
## load_description25          -6.414e+02  2.643e+02  -2.427 0.017434 *  
## load_description26           3.218e+02  2.493e+02   1.291 0.200435    
## load_description27          -9.828e+02  4.665e+02  -2.107 0.038172 *  
## load_description28          -3.755e+02  5.062e+02  -0.742 0.460266    
## load_description29          -1.691e+01  1.353e+02  -0.125 0.900834    
## load_description30           8.835e+02  3.685e+02   2.397 0.018785 *  
## load_description31          -4.435e+02  6.799e+02  -0.652 0.516019    
## load_description32           3.003e+02  4.228e+02   0.710 0.479459    
## load_description33           1.444e+03  8.769e+02   1.647 0.103488    
## load_description34           6.166e+01  5.800e+02   0.106 0.915590    
## load_description35          -7.019e+02  4.987e+02  -1.408 0.163044    
## load_description36          -2.105e+02  3.015e+02  -0.698 0.487074    
## load_description37          -1.036e+03  3.987e+02  -2.599 0.011081 *  
## load_description38           3.641e+02  4.361e+02   0.835 0.406151    
## load_description39          -5.161e+02  4.269e+02  -1.209 0.230157    
## load_description40          -1.172e+03  6.500e+02  -1.804 0.074936 .  
## load_description41          -9.583e+02  6.770e+02  -1.415 0.160712    
## load_description42          -1.041e+03  5.471e+02  -1.903 0.060512 .  
## load_description43          -1.945e+03  1.212e+03  -1.604 0.112535    
## load_description44          -8.739e+02  5.495e+02  -1.590 0.115616    
## load_description45          -1.232e+03  5.541e+02  -2.223 0.028939 *  
## load_description46          -3.171e+02  1.885e+02  -1.682 0.096343 .  
## load_description47           4.324e+02  3.481e+02   1.242 0.217629    
## load_description48           4.184e+02  3.395e+02   1.232 0.221384    
## load_description49          -7.212e+02  2.110e+02  -3.418 0.000986 ***
## load_description50           1.896e+04  1.115e+04   1.700 0.092963 .  
## load_description51           3.686e+02  4.197e+02   0.878 0.382410    
## load_description52           1.440e+03  8.692e+02   1.656 0.101496    
## load_description53           1.474e+02  3.252e+02   0.453 0.651468    
## load_description54           2.280e+02  3.275e+02   0.696 0.488307    
## load_description55          -1.311e+03  6.567e+02  -1.996 0.049223 *  
## load_description56          -1.583e+03  6.579e+02  -2.406 0.018354 *  
## destination_city1            7.716e+02  4.276e+02   1.804 0.074830 .  
## destination_city2            4.613e+02  3.054e+02   1.511 0.134728    
## destination_city3            1.274e+03  6.749e+02   1.888 0.062629 .  
## destination_city4            6.039e+02  3.477e+02   1.737 0.086146 .  
## destination_city5            1.752e+03  1.322e+03   1.325 0.188860    
## destination_city6           -1.914e+04  1.108e+04  -1.727 0.087870 .  
## destination_city7            5.035e+02  2.514e+02   2.003 0.048495 *  
## destination_city8                   NA         NA      NA       NA    
## destination_city9           -1.989e+02  2.435e+02  -0.817 0.416469    
## destination_city10                  NA         NA      NA       NA    
## destination_city11           1.261e+03  4.617e+02   2.732 0.007711 ** 
## destination_city12          -5.866e+02  4.392e+02  -1.335 0.185434    
## destination_city13          -1.437e+02  2.306e+02  -0.623 0.535056    
## destination_city14                  NA         NA      NA       NA    
## destination_city15                  NA         NA      NA       NA    
## destination_city16          -5.440e+02  3.563e+02  -1.527 0.130699    
## destination_city17           4.266e+02  2.312e+02   1.845 0.068634 .  
## destination_city18          -1.101e+02  1.594e+02  -0.691 0.491603    
## destination_city19          -7.488e+02  6.461e+02  -1.159 0.249785    
## destination_city20           2.074e+02  2.316e+02   0.895 0.373223    
## destination_city21           6.473e+02  3.726e+02   1.737 0.086063 .  
## destination_city22           6.525e+02  3.682e+02   1.772 0.080101 .  
## destination_city23           1.658e+02  1.604e+02   1.034 0.304205    
## destination_city24                  NA         NA      NA       NA    
## destination_city25                  NA         NA      NA       NA    
## destination_city26          -1.413e+03  8.151e+02  -1.734 0.086744 .  
## destination_city27           5.492e+02  2.998e+02   1.832 0.070616 .  
## destination_city28                  NA         NA      NA       NA    
## destination_city29           1.883e+02  2.511e+02   0.750 0.455403    
## destination_city30                  NA         NA      NA       NA    
## destination_city31           1.002e+03  5.719e+02   1.752 0.083514 .  
## destination_city32           2.544e+02  1.656e+02   1.537 0.128254    
## destination_city33           1.426e+03  8.453e+02   1.687 0.095390 .  
## destination_city34           1.631e+03  1.264e+03   1.290 0.200578    
## destination_city35           3.914e+02  1.965e+02   1.992 0.049711 *  
## destination_city36           3.948e+02  7.518e+02   0.525 0.600921    
## destination_city37           1.484e+02  2.389e+02   0.621 0.536132    
## destination_city38           9.378e+02  4.988e+02   1.880 0.063665 .  
## destination_city39                  NA         NA      NA       NA    
## destination_city40           3.984e+02  2.771e+02   1.438 0.154373    
## destination_city41           1.364e+02  1.791e+02   0.761 0.448609    
## destination_city42          -2.786e+01  1.912e+02  -0.146 0.884491    
## destination_city43                  NA         NA      NA       NA    
## destination_city44                  NA         NA      NA       NA    
## destination_city45                  NA         NA      NA       NA    
## destination_city46           6.856e+02  3.919e+02   1.749 0.083998 .  
## destination_city47           1.111e+02  1.505e+02   0.738 0.462561    
## destination_city48           9.588e+02  5.605e+02   1.710 0.090961 .  
## destination_city49           1.497e+02  1.887e+02   0.793 0.430007    
## destination_city50           2.349e+02  1.414e+02   1.662 0.100344    
## destination_city51                  NA         NA      NA       NA    
## destination_city52                  NA         NA      NA       NA    
## destination_city53           3.098e+02  1.919e+02   1.614 0.110356    
## destination_city54           8.079e+02  3.962e+02   2.039 0.044667 *  
## destination_city55                  NA         NA      NA       NA    
## destination_city56           1.583e+03  8.606e+02   1.840 0.069425 .  
## destination_city57           9.667e+02  4.326e+02   2.235 0.028146 *  
## destination_city58          -2.899e+02  5.729e+02  -0.506 0.614243    
## destination_city59                  NA         NA      NA       NA    
## destination_city60           2.001e+02  2.442e+02   0.819 0.415078    
## destination_city61                  NA         NA      NA       NA    
## destination_city62           3.298e+02  2.182e+02   1.512 0.134450    
## destination_city63           5.195e+01  6.292e+02   0.083 0.934397    
## destination_city64          -3.925e+02  3.099e+02  -1.267 0.208897    
## destination_city65          -9.606e+02  4.848e+02  -1.982 0.050876 .  
## destination_city66                  NA         NA      NA       NA    
## destination_city67          -1.592e+02  1.978e+02  -0.805 0.423311    
## destination_city68          -6.315e+01  1.696e+02  -0.372 0.710607    
## destination_city69                  NA         NA      NA       NA    
## destination_city70           1.247e+03  6.424e+02   1.941 0.055643 .  
## destination_city71           1.604e+02  1.621e+02   0.989 0.325496    
## destination_city72                  NA         NA      NA       NA    
## destination_city73           5.890e+02  3.334e+02   1.767 0.081006 .  
## destination_city74                  NA         NA      NA       NA    
## destination_city75                  NA         NA      NA       NA    
## destination_city76                  NA         NA      NA       NA    
## destination_city77                  NA         NA      NA       NA    
## pickup_scheduled_at_month1   1.204e+02  5.513e+01   2.183 0.031863 *  
## pickup_scheduled_at_month2   4.330e+01  6.260e+01   0.692 0.491046    
## pickup_scheduled_at_month3   4.358e+01  7.160e+01   0.609 0.544413    
## pickup_scheduled_at_month4   7.580e+01  7.356e+01   1.031 0.305775    
## pickup_scheduled_at_month5   1.611e+01  8.058e+01   0.200 0.842031    
## pickup_scheduled_at_month6  -3.642e+02  3.864e+02  -0.943 0.348603    
## pickup_scheduled_at_month7          NA         NA      NA       NA    
## pickup_scheduled_at_month8   2.731e+01  7.715e+01   0.354 0.724311    
## pickup_scheduled_at_month9   1.165e+01  5.842e+01   0.199 0.842413    
## pickup_scheduled_at_month10 -2.592e+01  5.088e+01  -0.509 0.611772    
## pickup_scheduled_at_month11         NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 13456.45)
## 
##     Null deviance: 9110366  on 202  degrees of freedom
## Residual deviance: 1103429  on  82  degrees of freedom
## AIC: 2566
## 
## Number of Fisher Scoring iterations: 2
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
##     round.pred..digits...2. round.md.fh_commission.test...digits...2.
## 11                   100.00                                    100.00
## 16                   544.58                                    400.00
## 20                   536.29                                   1000.00
## 28                   186.06                                    150.00
## 29                   132.03                                    300.00
## 35                   277.21                                    300.00
## 47                   700.00                                    700.00
## 59                   100.00                                    100.00
## 70                   223.97                                    200.00
## 74                   257.35                                    250.00
## 76                    50.00                                     50.00
## 77                   567.19                                    600.00
## 80                   269.48                                    250.00
## 85                   195.34                                    200.00
## 90                   157.96                                    200.00
## 92                   393.30                                    600.00
## 97                   114.36                                      0.00
## 99                   334.15                                    200.00
## 107                  200.00                                    200.00
## 109                  620.94                                    700.01
## 113                  253.66                                    150.00
## 114                   41.42                                     28.00
## 129                  233.16                                    388.50
## 170                  200.00                                    200.00
## 183                 -300.00                                   -300.00
## 199                  180.99                                    200.00
## 206                  250.00                                    250.00
## 210                   61.11                                     50.00
## 217                 1030.00                                   1030.00
## 229                  250.30                                    300.00
## 238                  556.16                                    600.00
## 252                  144.20                                    100.00
## 263                  249.70                                    200.00
## 266                  450.00                                    500.00
## 268                  528.85                                    700.00
## 278                  245.77                                     50.00
## 280                  544.58                                    425.00
## 282                  254.59                                    250.00
## 286                  200.00                                    200.00
## 287                  150.00                                    150.00
## 289                 -143.00                                   -143.00
## 296                  185.96                                    200.00
## 300                  250.00                                    250.00
## 301                  253.66                                    500.00
## 308                  273.00                                    273.00
## 314                  212.50                                    212.50
## 316                  550.00                                    550.00
## 318                  186.00                                    186.00
## 320                  106.50                                    106.50
## 323                  586.56                                    600.00
## 330                  281.13                                    300.00
## 344                   50.00                                     50.00
## 346                  200.00                                    200.00
## 360                  395.00                                    395.00
## 361                   72.02                                    100.00
## 363                  412.76                                    200.00
## 378                   25.00                                     25.00
## 383                  309.06                                    309.06
## 385                  368.54                                    161.84
## 388                  553.40                                    400.01
## 395                  133.64                                     50.00
## 396                  250.00                                    250.00
## 401                  300.00                                    300.00
## 405                  240.00                                    240.00
## 409                  183.50                                    200.00
## 419                  542.37                                    600.00
## 442                  210.01                                      0.00
## 455                 -100.00                                   -100.00
## 456                  450.00                                    400.00
## 457                   50.00                                     50.00
## 459                  100.00                                    100.00
## 460                  379.19                                    400.00
## 461                  379.19                                    500.00
## 462                  300.01                                    200.00
## 465                  153.56                                    150.00
## 470                   42.80                                    100.00
## 471                 -100.00                                   -100.00
## 475                  550.00                                    550.00
## 480                  609.00                                    609.00
## 491                  207.48                                    200.00
## 495                  390.17                                    250.00
## 497                  254.60                                    300.00
## 498                  420.47                                    350.00
## 505                  528.85                                    700.00
## 513                  112.04                                    100.00
## 520                  198.70                                    100.00
## 523                  565.00                                    565.00
## 525                   37.58                                    100.00
## 527                  500.00                                    500.00
## 529                  205.00                                    150.00
## 530                  343.55                                    343.55
## 540                  190.49                                    300.00
## 541                    0.00                                      0.00
## 543                   95.00                                     95.00
## 545                   95.00                                     95.00
## 550                  386.20                                    400.00
## 558                  679.85                                    650.00
## 562                  700.00                                    700.00
## 564                  262.59                                    300.00
## 565                  157.96                                    200.00
## 572                  157.96                                      0.00
## 575                  300.00                                    300.00
## 580                  233.05                                    200.00
## 584                  253.66                                    100.00
## 585                  300.00                                    300.00
## 590                  253.91                                    200.00
## 601                  300.00                                    300.00
## 605                  233.06                                    250.00
## 606                  142.06                                    170.00
## 617                    0.00                                      0.00
## 623                  233.06                                    350.00
## 626                   69.88                                     71.40
## 628                  189.63                                    189.63
## 630                  484.50                                    484.50
## 631                  103.16                                     71.40
## 638                  173.92                                    150.00
## 639                  544.58                                    500.00
## 641                  233.16                                    200.00
## 642                  544.58                                    425.00
## 646                  500.00                                    500.00
## 649                  102.66                                    212.25
## 657                  200.00                                    200.00
## 658                  300.01                                    250.00
## 660                   30.00                                     30.00
## 676                  200.00                                    200.00
## 680                  231.60                                    300.00
## 681                  200.00                                    200.00
## 685                  200.83                                     50.00
## 692                    0.00                                      0.00
## 698                    0.00                                      0.00
## 701                   39.78                                    100.00
## 705                  287.50                                    287.50
## 706                  132.13                                    100.00
## 713                  178.06                                    150.00
## 715                   92.00                                     92.00
## 727                   60.68                                     28.00
## 728                  200.00                                    200.00
## 736                  157.08                                    157.08
## 739                  222.79                                    200.00
## 744                  230.60                                    250.00
## 749                  439.36                                    450.00
## 751                  201.30                                    300.00
## 755                  640.11                                    600.00
## 756                  500.00                                    500.00
## 758                  148.90                                    100.00
## 764                  322.72                                    400.00
## 771                  659.89                                    700.00
## 774                  355.80                                    400.00
## 775                  310.59                                    300.00
## 777                  265.85                                    400.00
## 778                  276.03                                    300.00
## 784                 -130.00                                   -130.00
## 785                  100.00                                    100.00
## 787                  114.36                                    100.00
## 790                   50.00                                     50.00
## 791                  150.00                                    150.00
## 793                 -150.00                                   -150.00
## 800                  150.00                                    150.00
## 803                   39.78                                    100.00
## 804                  200.00                                    200.00
## 806                  233.06                                    250.00
## 811                  100.00                                    100.00
## 813                  142.06                                    150.00
## 830                  185.58                                    104.72
## 831                   69.88                                     71.40
## 835                  600.00                                    600.00
## 838                  816.30                                    816.30
## 839                  816.30                                    816.30
## 841                  423.60                                    423.60
## 842                  300.00                                    300.00
## 845                  222.19                                    100.00
## 848                  200.00                                    200.00
## 852                  171.62                                    300.00
## 863                   54.82                                    150.00
## 865                  229.53                                    300.00
## 869                  114.36                                      0.00
## 873                    0.00                                      0.00
## 875                   50.00                                     50.00
## 877                    0.00                                      0.00
## 902                  205.00                                    260.00
## 904                  142.57                                    150.00
## 905                   39.78                                    100.00
## 909                    0.00                                      0.00
## 914                  142.57                                    100.00
## 920                  210.00                                    210.00
## 924                  620.15                                    650.00
## 925                  118.37                                    150.00
## 926                   60.00                                     60.00
## 929                  500.00                                    500.00
## 930                   75.00                                     75.00
## 937                  390.92                                    400.00
## 938                  300.00                                    300.00
## 949                  151.10                                    200.00
## 964                  100.00                                    100.00
## 965                  150.00                                    150.00
## 975                  164.86                                    200.00
## 977                  409.08                                    400.00
## 978                  247.22                                    200.00
## 981                  131.72                                    100.00
## 983                   80.89                                     71.40
## 986                   40.32                                     52.36
## 992                    3.04                                      3.04
## 998                  608.80                                    500.00